Incorporating linguistic post-processing into whole-book recognition
نویسندگان
چکیده
We describe a technique of linguistic post-processing of whole-book recognition results. Whole-book recognition is a technique that improves recognition of book images using fully automatic cross-entropy-based model adaptation. In previous published works, word recognition was performed on individual words separately, without awaring passage-level information such as word-occurrence frequencies. Therefore, some rare words in real texts may appear much more often in recognition results; vice versa. Differences between word frequencies in recognition results and in prior knowledge may indicate recognition errors on a long passage. In this paper, we propose a post-processing technique to enhance whole-book recognition results by minimizing differences between word frequencies in recognition results and prior word frequencies. This technique works better when operating on longer passages, and it drives the character error rate down 20% from 1.24% to 0.98% in a 90-page experiment.
منابع مشابه
Incorporating A Rich Linguistic Model into Whole-Book Recognition
Whole-book recognition, a technique that improves recognition of book-images using fully automatic mutual-entropybased model adaptation, has achieved character error rate as low as 1.9% on 50 pages of real book images in our previous publications. However, the linguistic model for word recognition was simple, assuming a uniform distribution on the words in the dictionary, so that the algorithm ...
متن کاملIncorporating linguistic knowledge and automatic baseform generation in acoustic subword unit based speech recognition
A major challenge in speech recognition based on acoustic subword units is creating a lexicon which is robust to interand intra-speaker variations. In this paper we present two di erent approaches for incorporating simple word-level linguistic knowledge into the labelling step of the training procedure. The proposed systems also utilise a scheme for combined optimisation of baseforms and subwor...
متن کاملMulti-level post-processing for Korean character recognition using morphological analysis and linguistic evaluation
Most of the post-processing methods for character recognition rely on contextual information of character and word-fragment levels. However, due to linguistic characteristics of Korean, such low-level information alone is not sufficient for high-quality character-recognition applications, and we need much higher-level contextual information to improve the recognition results. This paper present...
متن کاملIncorporating Cognitive Linguistic Insights into Classrooms: the Case of Iranian Learners’ Acquisition of If-Clauses
Cognitive linguistics gives the most inclusive, consistent description of how language is organized, used and learned to date. Cognitive linguistics contains a great number of concepts that are useful to second language learners. If-clauses in English, on the other hand, remain intriguing for foreign language learners to struggle with, due to their intrinsic intricacies. EFL grammar books are ...
متن کاملLAperLA: an integrated graphical-linguistic System for old printed Latin Texts
LAperLA (Lettore Automatico per Libri Antichi) is a prototype for the automatic recognition of Latin texts in old printed books. The strengths of the system are the neural architecture and the post-processing linguistic tool that is represented by an index of Latin forms (more than 500,000) and by a query management system which uses the information of the index to check and correct the interpr...
متن کامل